Sarcoidosis scRNA-seq: Quality control and clustering PBMC Samples

Contents:¶

  • Sarcoidosis scRNA-seq:Quality control and clustering PBMC Samples
  • Importing all necessary python packages
  • Meta data of Samples
  • Loading Data
  • Preprocessing
  • Basic Filter
  • Pre QC metrics calculation: Sample wise
  • Post QC metrics cutoffs: Sample wise
  • Normalization: Counts Per Million (CPM)
  • Logarithmize
  • Highly-variable genes filtering
  • Regressing out: total_counts', 'pct_counts_mt','pct_counts_ribo' to all samples
  • Scaling
  • PCA Analysis
  • Saving data
  • Reading dataset
  • Computing the neighborhood graph and clustering
  • Saving data after neighborhood graph and clustering

Sarcoidosis scRNA-seq: Quality control and clustering PBMC Samples ¶

All of peripheral blood mononuclear cells (PBMC) datasets are processed by Pipeline-Version: cellranger-7.1.0

Importing all necessary python packages ¶

In [1]:
# Import necessary libraries

import scanpy as sc #software suite of tools for single-cell analysis in python
import besca as bc #internal BEDA package for single cell analysis
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy
import anndata as ad
from scipy.sparse import csr_matrix
import scanpy.external as sce
from harmony import harmonize
import umap.umap_ as umap
import os
from scipy import io
print(ad.__version__)

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpunf05skn
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpunf05skn/_remote_module_non_scriptable.py
INFO:lightning_fabric.utilities.seed:Global seed set to 0
0.9.1

Meta data of Samples ¶

  1. pbmcsarc1: SAM24412250-Sarcoidosis_Donor1_PBMC-male-57yrs-white from Genentech (10x Genomics Chromium v3.1 3’ NovaSeq 6000)

  2. pbmcsarc2: SAM24412252 Sarcoidosis_Donor2_PBMC: male-35yrs-southasisan sequenced by Genentech (10x Genomics Chromium v3.1 3’ NovaSeq 6000)

  3. pbmcsarc3: SAM24412252 Sarcoidosis_Donor3_PBMC: female-60yrs-white sequenced by Genentech (10x Genomics Chromium v3.1 3’ NovaSeq 6000)

  4. pbmchealthy1: SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_h1: healthy female-19yrs from 10x Genomics database (10x Genomics Chromium v3.1 3’ NovaSeq 6000). For more info link

  5. pbmchealthy2: 5k_pbmc_v3_nextgem_fastqs_h2 from 10x Genomics database a healthy donor (gender not specified) (10x Genomics Chromium v3.1 3’ NovaSeq 6000). For more info Link

  6. pbmchealthy3: 3p_Citrate_CPT_fastqs_h3: Healthy female from 10x Genomics database (10x Genomics Chromium v3.1 3’ NovaSeq 6000). For more info Link

  7. pbmchealthy4: 10k_PBMC_3p_nextgem_Chromium_X_fastqs_h4: Healthy female-25-30yrs (10x Genomics Chromium v3.1 3’ NovaSeq 6000). For more info Link

Loading Data ¶

In [2]:
# Load 10x Genomics data for the first directory - Disease PBMC dataset1 for sarcoidosis
pbmcsarc1 = sc.read_10x_mtx('/raid02/Data-live/tjana/LIB5455299_SAM24412250/outs/filtered_feature_bc_matrix/', 
                        var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the second directory - Disease PBMC dataset2 for sarcoidosis
pbmcsarc2 = sc.read_10x_mtx('/raid02/Data-live/tjana/LIB5455301_SAM24412252/outs/filtered_feature_bc_matrix/', 
                        var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the third directory - Disease PBMC dataset3 for sarcoidosis
pbmcsarc3 = sc.read_10x_mtx('/raid02/Data-live/tjana/LIB5455303_SAM24412254/outs/filtered_feature_bc_matrix/', 
                        var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the fourth directory - Healthy PBMC Control1 from 10X library
pbmchealthy1 = sc.read_10x_mtx('/raid02/Data-live/tjana/multi/SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_h1/outs/per_sample_outs/PBMCs_human_2/count/sample_filtered_feature_bc_matrix/', 
                         var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the fifth directory - Healthy PBMC Control2 from 10X library
pbmchealthy2 = sc.read_10x_mtx('/raid02/Data-live/tjana/5k_pbmc_v3_nextgem_fastqs_h2/outs/filtered_feature_bc_matrix/', 
                         var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the sixth directory - Healthy PBMC Control3 from 10X library
pbmchealthy3 = sc.read_10x_mtx('/raid02/Data-live/tjana/3p_Citrate_CPT_fastqs_h3/outs/filtered_feature_bc_matrix/', 
                         var_names='gene_symbols', cache=True)

# Load 10x Genomics data for the seventh directory - Healthy PBMC Control4 from 10X library
pbmchealthy4 = sc.read_10x_mtx('/raid02/Data-live/tjana/10k_PBMC_3p_nextgem_Chromium_X_fastqs_h4/outs/filtered_feature_bc_matrix/', 
                         var_names='gene_symbols', cache=True)
... reading from cache file cache/raid02-Data-live-tjana-LIB5455299_SAM24412250-outs-filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-LIB5455301_SAM24412252-outs-filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-LIB5455303_SAM24412254-outs-filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-multi-SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_h1-outs-per_sample_outs-PBMCs_human_2-count-sample_filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-5k_pbmc_v3_nextgem_fastqs_h2-outs-filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-3p_Citrate_CPT_fastqs_h3-outs-filtered_feature_bc_matrix-matrix.h5ad
... reading from cache file cache/raid02-Data-live-tjana-10k_PBMC_3p_nextgem_Chromium_X_fastqs_h4-outs-filtered_feature_bc_matrix-matrix.h5ad

Make variable names unique for each dataset

In [3]:
# Make variable names unique for each dataset

pbmcsarc1.var_names_make_unique()
pbmcsarc2.var_names_make_unique()
pbmcsarc3.var_names_make_unique()
pbmchealthy1.var_names_make_unique()
pbmchealthy2.var_names_make_unique()
pbmchealthy3.var_names_make_unique()
pbmchealthy4.var_names_make_unique()

Adding some metadata for all PBMC samples

In [4]:
# Adding some metadata for all PBMC samples

pbmcsarc1.obs['type']="Sarcoidosis"
pbmcsarc1.obs['sample']="PBMC-Sarc-1"
pbmcsarc2.obs['type']="Sarcoidosis"
pbmcsarc2.obs['sample']="PBMC-Sarc-2"
pbmcsarc3.obs['type']="Sarcoidosis"
pbmcsarc3.obs['sample']="PBMC-Sarc-3"


pbmchealthy1.obs['type']="Healthy"
pbmchealthy1.obs['sample']="PBMC-healthy-1"
pbmchealthy2.obs['type']="Healthy"
pbmchealthy2.obs['sample']="PBMC-healthy-2"
pbmchealthy3.obs['type']="Healthy"
pbmchealthy3.obs['sample']="PBMC-healthy-3"
pbmchealthy4.obs['type']="Healthy"
pbmchealthy4.obs['sample']="PBMC-healthy-4"

Preprocessing ¶

Explore the loaded data before preprocessing for each dataset using a for loop

In [5]:
# Explore the loaded data for each dataset using a for loop

for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    display(adata_list)
AnnData object with n_obs × n_vars = 7438 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 10029 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 8754 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 6093 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 5184 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 3958 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 11999 × 36601
    obs: 'type', 'sample'
    var: 'gene_ids', 'feature_types'

chromosomeY: male and XIST expression: female computing

In [6]:
# chromosomeY: males and XIST (X-inactive specific transcript):female

def get_biomart_annotations(species, gene_info):
    return sc.queries.biomart_annotations(species, gene_info).set_index("external_gene_name")

def chromosomeY_adjustment_step1(adata, species="hsapiens", gene_info=["ensembl_gene_id", "external_gene_name", "start_position", "end_position", "chromosome_name"]):
    annot = get_biomart_annotations(species, gene_info)
    chrY_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == "Y"])
    return chrY_genes

def calculate_percent_chrY(adata, chrY_genes):
    adata.obs['percent_chrY'] = np.sum(adata[:, chrY_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100

def add_XIST_expression_to_obs(adata):
    adata.obs["XIST-counts"] = adata.X[:, adata.var_names.str.match('XIST')].toarray()


# Example usage

i=0
for adata in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    chrY_genes = chromosomeY_adjustment_step1(adata)
    calculate_percent_chrY(adata, chrY_genes)
    add_XIST_expression_to_obs(adata)
    i=i+1

Explore the dataset after chromosomeY: male and XIST expression: female computing

In [7]:
for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    display(adata_list)
AnnData object with n_obs × n_vars = 7438 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 10029 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 8754 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 6093 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 5184 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 3958 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 11999 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
In [8]:
# Adding some metadata for all PBMC samples

adata = pbmcsarc1.concatenate(pbmcsarc2,pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4)
sc.pl.violin(adata, ["XIST-counts", "percent_chrY"], jitter=0.4, groupby = 'sample', rotation= 90)
del(adata)
/home/jana/my-notebook-venv/lib/python3.8/site-packages/anndata/_core/anndata.py:1755: FutureWarning:

The AnnData.concatenate method is deprecated in favour of the anndata.concat function. Please use anndata.concat instead.

See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html

... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
In [9]:
for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    display(adata_list)
AnnData object with n_obs × n_vars = 7438 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 10029 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 8754 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 6093 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 5184 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 3958 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 11999 × 36601
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts'
    var: 'gene_ids', 'feature_types'

Visualize highest expression genes for each dataset

In [10]:
# Visualize highest expression genes for each dataset in separate panels using a for loop
# Explore the loaded data for each dataset using a for loop


adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

def highest_expr_genes(adata_list, n_top=20):
    for i, adata in enumerate(adata_list):
        sc.pl.highest_expr_genes(adata, n_top=n_top, show=False)
        plt.title(f'sample {i+1}')
        plt.show()
        
        # Tabular form
        df = pd.DataFrame(adata[:, adata.var_names].X.sum(axis=0).A1, index=adata.var_names, columns=['Total Expression'])
        df = df.sort_values(by='Total Expression', ascending=False)[:n_top]
        print(f"Top {n_top} expressed genes in Dataset {i+1}:")
        print(df)

# Example usage:
highest_expr_genes(adata_list, n_top=20)
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 1:
         Total Expression
MALAT1          1525700.0
MT-CO1           393140.0
MT-ATP6          378563.0
MT-CO3           375700.0
HBB              334863.0
B2M              317130.0
MT-CO2           308940.0
LYZ              261975.0
EEF1A1           250771.0
RPL13            244283.0
TMSB4X           243103.0
RPLP1            234117.0
MT-ND3           232526.0
FTL              215912.0
TPT1             215054.0
RPL10            212866.0
MT-ND4           207993.0
MT-CYB           206646.0
RPL41            199434.0
RPS12            188200.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 2:
         Total Expression
MALAT1          1705647.0
MT-CO1           527512.0
B2M              422333.0
MT-ATP6          400251.0
MT-CO3           379719.0
MT-CO2           372997.0
LYZ              365990.0
MT-ND3           318675.0
RPS27            314161.0
TMSB4X           308324.0
EEF1A1           300396.0
RPL13            282789.0
RPLP1            272607.0
RPS29            266164.0
RPL41            250601.0
RPL10            250089.0
S100A9           243398.0
TPT1             241819.0
MT-ND4           215758.0
RPS12            211189.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 3:
         Total Expression
MALAT1          1125388.0
MT-CO1           341985.0
MT-CO3           315503.0
MT-ATP6          307319.0
MT-CO2           294682.0
B2M              262201.0
LYZ              238854.0
RPL13            222471.0
RPLP1            221918.0
TMSB4X           210556.0
EEF1A1           203646.0
S100A9           201671.0
RPL41            196932.0
HBB              194717.0
RPL10            193964.0
MT-ND3           192795.0
RPS27            186924.0
TPT1             182570.0
MT-ND4           180397.0
FTL              177435.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 4:
         Total Expression
MALAT1          1845116.0
MT-CO1           669359.0
MT-CO2           487299.0
B2M              467265.0
MT-CO3           460713.0
MT-ATP6          448046.0
EEF1A1           415622.0
RPS27            395078.0
MT-ND3           389160.0
RPL41            360598.0
RPL13            349887.0
TMSB4X           329392.0
RPL10            326027.0
RPS12            320573.0
TPT1             317843.0
S100A9           311171.0
MT-ND4           310510.0
FTL              300407.0
MT-CYB           283225.0
ACTB             273351.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 5:
         Total Expression
MALAT1          1886234.0
MT-CO1           703721.0
MT-CO2           492846.0
MT-CO3           468851.0
EEF1A1           463455.0
B2M              432464.0
MT-ATP6          380837.0
RPL10            376486.0
RPS12            357208.0
RPL13            346016.0
RPL41            345644.0
MT-ND4           345588.0
RPLP1            344371.0
TPT1             329366.0
TMSB4X           310163.0
RPS27            303331.0
MT-CYB           281887.0
RPL32            264740.0
RPL30            244614.0
ACTB             236056.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 6:
         Total Expression
MALAT1           867392.0
MT-CO3           359978.0
HBB              355931.0
MT-CO2           349212.0
MT-ATP6          328405.0
MT-CO1           285182.0
B2M              229990.0
MT-CYB           229076.0
EEF1A1           216418.0
MT-ND4           215414.0
RPL10            212588.0
RPS12            206968.0
RPL13            205803.0
TPT1             172752.0
MT-ND1           169921.0
RPL41            168195.0
RPS18            161378.0
RPS27            157109.0
MT-ND3           152245.0
RPLP1            148315.0
normalizing counts per cell
    finished (0:00:00)
Top 20 expressed genes in Dataset 7:
         Total Expression
MALAT1          3711621.0
MT-CO1          1367091.0
B2M             1022907.0
MT-ATP6          991239.0
MT-CO2           954794.0
MT-CO3           896199.0
TMSB4X           860974.0
EEF1A1           820617.0
S100A9           758427.0
TPT1             740911.0
RPL13            722824.0
RPL10            675838.0
MT-ND3           663237.0
RPS12            657113.0
RPL41            643249.0
RPS27            634970.0
FTL              629270.0
MT-ND4           616031.0
MT-CYB           603963.0
LYZ              583248.0

Basic Filter ¶

In [11]:
print ("filtering out genes in less than 3 cells")

for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    sc.pp.filter_cells(adata_list, min_genes=200)
    sc.pp.filter_genes(adata_list, min_cells=3)
filtering out genes in less than 3 cells
filtered out 381 cells that have less than 200 genes expressed
filtered out 16930 genes that are detected in less than 3 cells
filtered out 87 cells that have less than 200 genes expressed
filtered out 16207 genes that are detected in less than 3 cells
filtered out 192 cells that have less than 200 genes expressed
filtered out 17692 genes that are detected in less than 3 cells
filtered out 16 cells that have less than 200 genes expressed
filtered out 11871 genes that are detected in less than 3 cells
filtered out 41 cells that have less than 200 genes expressed
filtered out 10844 genes that are detected in less than 3 cells
filtered out 142 cells that have less than 200 genes expressed
filtered out 14414 genes that are detected in less than 3 cells
filtered out 32 cells that have less than 200 genes expressed
filtered out 9238 genes that are detected in less than 3 cells

Pre QC metrics calculation: Sample wise ¶

In [12]:
# Identifying mitochondrial genes and ribosomal genes and then calculate QC metrics for each dataset
i=1
for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    adata_list.var['mt'] = adata_list.var_names.str.startswith('MT-') # mitochondrial genes 'MT-''
    adata_list.var['ribo'] = adata_list.var_names.str.startswith(("RPS","RPL")) # ribosomal genes 'RPS/RPL'
    sc.pp.calculate_qc_metrics(adata_list, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
    display ("sample no"+str(i))
    sc.pl.violin(adata_list, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)
    i=i+1
'sample no1'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no2'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no3'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no4'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no5'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no6'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical
'sample no7'
... storing 'type' as categorical
... storing 'sample' as categorical
... storing 'feature_types' as categorical

Post QC metrics cutoffs: Sample wise ¶

In [13]:
for adata_list in [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]:
    display(adata_list)
AnnData object with n_obs × n_vars = 7057 × 19671
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 9942 × 20394
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 8562 × 18909
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 6077 × 24730
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 5143 × 25757
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 3816 × 22187
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
AnnData object with n_obs × n_vars = 11967 × 27363
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

First Sample: pbmcsarc1 QC metrices cutoff: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [14]:
# Filter cells based on QC metrics
pbmcsarc1 = pbmcsarc1[pbmcsarc1.obs.n_genes_by_counts < 6000, :]  #The number of genes expressed in the count matrix
pbmcsarc1 = pbmcsarc1[pbmcsarc1.obs.total_counts < 30000, :]  #The total counts per cell
pbmcsarc1 = pbmcsarc1[pbmcsarc1.obs.pct_counts_mt < 19, :] #The percentage of counts in mitochondrial genes
pbmcsarc1 = pbmcsarc1[pbmcsarc1.obs.pct_counts_ribo <60, :] #The percentage of counts in ribosomal genes

Second Sample: pbmcsarc2 QC metrices cutoff: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [15]:
# Filter cells based on QC metrics
pbmcsarc2 = pbmcsarc2[pbmcsarc2.obs.n_genes_by_counts < 7000, :] #The number of genes expressed in the count matrix
pbmcsarc2 = pbmcsarc2[pbmcsarc2.obs.total_counts < 30000, :] #The total counts per cell
pbmcsarc2 = pbmcsarc2[pbmcsarc2.obs.pct_counts_mt < 15, :] #The percentage of counts in mitochondrial genes
pbmcsarc2= pbmcsarc2[pbmcsarc2.obs.pct_counts_ribo <60, :]  #The percentage of counts in ribosomal genes

Third Sample: pbmcsarc3 QC metrices cutoffs: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [16]:
# Filter cells based on QC metrics
pbmcsarc3 = pbmcsarc3[pbmcsarc3.obs.n_genes_by_counts < 5000, :] #The number of genes expressed in the count matrix
pbmcsarc3 = pbmcsarc3[pbmcsarc3.obs.total_counts < 15000, :] #The total counts per cell
pbmcsarc3 = pbmcsarc3[pbmcsarc3.obs.pct_counts_mt < 15, :]  #The percentage of counts in mitochondrial genes
pbmcsarc3 = pbmcsarc3[pbmcsarc3.obs.pct_counts_ribo <60, :]  #The percentage of counts in ribosomal genes

Fourth Sample: pbmchealthy1 QC metrices cutoffs: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [17]:
# Filter cells based on QC metrics
pbmchealthy1 = pbmchealthy1[pbmchealthy1.obs.n_genes_by_counts < 8000, :] #The number of genes expressed in the count matrix
pbmchealthy1 = pbmchealthy1[pbmchealthy1.obs.total_counts < 40000, :] #The total counts per cell
pbmchealthy1 = pbmchealthy1[pbmchealthy1.obs.pct_counts_mt < 15, :] #The percentage of counts in mitochondrial genes
pbmchealthy1 = pbmchealthy1[pbmchealthy1.obs.pct_counts_ribo <50, :] #The percentage of counts in ribosomal gene

Fifth Sample: pbmchealthy2 QC metrices cutoff: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [18]:
# Filter cells based on QC metrics
pbmchealthy2 = pbmchealthy2[pbmchealthy2.obs.n_genes_by_counts < 8000, :] #The number of genes expressed in the count matrix
pbmchealthy2 = pbmchealthy2[pbmchealthy2.obs.total_counts < 50000, :] #The total counts per cell
pbmchealthy2 = pbmchealthy2[pbmchealthy2.obs.pct_counts_mt < 15, :] #The percentage of counts in mitochondrial genes
pbmchealthy2 = pbmchealthy2[pbmchealthy2.obs.pct_counts_ribo <50, :]  #The percentage of counts in ribosomal genes

Sixth Sample: pbmchealthy3 QC metrices cutoffs: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [19]:
# Filter cells based on QC metrics
pbmchealthy3 = pbmchealthy3[pbmchealthy3.obs.n_genes_by_counts < 7000, :] #The number of genes expressed in the count matrix
pbmchealthy3 = pbmchealthy3[pbmchealthy3.obs.total_counts < 40000, :] #The total counts per cell
pbmchealthy3 = pbmchealthy3[pbmchealthy3.obs.pct_counts_mt < 15, :] #The percentage of counts in mitochondrial genes
pbmchealthy3 = pbmchealthy3[pbmchealthy3.obs.pct_counts_ribo <50, :] #The percentage of counts in ribosomal genes

Seventh Sample: pbmchealthy4 QC metrices cutoffs: n_genes_by_counts,total_counts, pct_counts_mt and pct_counts_ribo

In [20]:
# Filter cells based on QC metrics
pbmchealthy4 = pbmchealthy4[pbmchealthy4.obs.n_genes_by_counts < 8000, :] #The number of genes expressed in the count matrix
pbmchealthy4 = pbmchealthy4[pbmchealthy4.obs.total_counts < 50000, :] #The total counts per cell
pbmchealthy4 = pbmchealthy4[pbmchealthy4.obs.pct_counts_mt < 15, :] #The percentage of counts in mitochondrial genes
pbmchealthy4 = pbmchealthy4[pbmchealthy4.obs.pct_counts_ribo <50, :] #The percentage of counts in ribosomal genes

PostQC Violin plots sample wise

In [22]:
print("PostQC for First Sample: pbmcsarc1")
sc.pl.violin(pbmcsarc1, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for Second Sample: pbmcsarc2")
sc.pl.violin(pbmcsarc2, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for Third Sample: pbmcsarc3")
sc.pl.violin(pbmcsarc3, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for Fourth Sample: pbmchealthy1")
sc.pl.violin(pbmchealthy1, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for Fifth Sample: pbmchealthy2")
sc.pl.violin(pbmchealthy2, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for sixth Sample: pbmchealthy3")
sc.pl.violin(pbmchealthy3, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)

print("PostQC for seventh Sample: pbmchealthy4")
sc.pl.violin(pbmchealthy4, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                 jitter=0.4, multi_panel=True)
PostQC for First Sample: pbmcsarc1
PostQC for Second Sample: pbmcsarc2
PostQC for Third Sample: pbmcsarc3
PostQC for Fourth Sample: pbmchealthy1
PostQC for Fifth Sample: pbmchealthy2
PostQC for sixth Sample: pbmchealthy3
PostQC for seventh Sample: pbmchealthy4

Normalization: Counts Per Million (CPM) ¶

In [23]:
#each cell by total counts over all genes,

# Assuming adata1 to adata7 are your datasets
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Normalization each cell by total counts over all genes (library-size correct) the data matrix to 10,000 reads per cell (target_sum=1e4)

for adata in adata_list:
    sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning:

Received a view of an AnnData. Making a copy.

    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)

Logarithmize ¶

In [24]:
# Assuming adata1 to adata7 are your datasets
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Computes X=log(X+1) , where log denotes the natural logarithm
for adata in adata_list:
    sc.pp.log1p(adata)

Highly-variable genes filtering ¶

Identify highly-variable genes

In [25]:
# Assuming adata1 to adata7 are your datasets
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Apply log1p transformation to each adata
for adata in adata_list:
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
extracting highly variable genes
    finished (0:00:02)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

Plotting highly-variable genes

In [26]:
# Assuming adata1 to adata7 are your datasets
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Total-count normalize (library-size correct) the data matrix to 10,000 reads per cell
for adata in adata_list:
    sc.pl.highly_variable_genes(adata)

Set to raw formats to all samples

In [27]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Set raw attribute for each adata
for adata in adata_list:
    adata.raw = adata

highly-variable genes filtering to all samples

In [28]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]
# Set raw attribute for each adata
for adata in adata_list:
    adata = adata[:, adata.var.highly_variable]

Regressing out: total_counts', 'pct_counts_mt','pct_counts_ribo' to all samples ¶

The regression of total counts per cell, along with the percentage of mitochondrial genes and ribosomal genes, is a commonly employed technique that enhances the quality of scRNA-seq data analysis by mitigating confounding factors related to cell quality and technical variability. (PMID: 29752298)

In [29]:
#Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. 
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Set raw attribute for each adata
for adata in adata_list:
    sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt','pct_counts_ribo'])
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:06:59)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:08:45)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:07:06)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:07:56)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:07:17)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:05:26)
regressing out ['total_counts', 'pct_counts_mt', 'pct_counts_ribo']
    sparse input is densified and may lead to high memory use
    finished (0:13:45)

Scaling ¶

In [30]:
# Scale each gene to unit variance up to standard deviation 10 to all samples
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]


for adata in adata_list:
    sc.pp.scale(adata, max_value=10)

PCA Analysis ¶

In [31]:
import copy
import matplotlib.pyplot as plt

# Create a deep copy of adata_list
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]
adata_list_copy = copy.deepcopy(adata_list)

# Initialize an empty list to store the variance ratios
variance_ratios = []

for n_pcs in range(1, 52):
    temp_variances = []
    for adata_temp in adata_list_copy:  # Iterate over each AnnData object
        adata_temp = adata_temp.copy()  # Create a copy of the AnnData object
        sc.tl.pca(adata_temp, n_comps=n_pcs, svd_solver='arpack')
        temp_variances.append(adata_temp.uns['pca']['variance_ratio'])
    variance_ratios.append(temp_variances)

# Plot the explained variance ratio for each PC
plt.figure(figsize=(10, 6))
for n_pcs in range(1, 52):
    for idx, var_ratio in enumerate(variance_ratios[n_pcs - 1], 1):
        plt.plot(range(1, n_pcs+1), var_ratio, marker='o', label=f'n_pcs={n_pcs}, dataset={idx}')
plt.xlabel('Number of PCs')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio for Each PC')
plt.legend(bbox_to_anchor=(1.4,0.8))
plt.show()

# Delete temporary objects
del adata_list_copy
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=1
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=2
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=3
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=4
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=5
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=6
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=7
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=8
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=9
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=10
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=11
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=12
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=13
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=14
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=15
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=16
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=17
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=18
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=19
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=21
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=22
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=23
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=24
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=25
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=26
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=27
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=28
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=29
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=30
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=31
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=32
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=33
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=34
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=35
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=36
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=37
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=38
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=39
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=40
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=41
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=42
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=43
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=44
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=45
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=46
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=47
    finished (0:00:07)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:02)
computing PCA
    on highly variable genes
    with n_comps=48
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=49
    finished (0:00:07)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:07)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:04)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:07)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:06)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:05)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:03)
computing PCA
    on highly variable genes
    with n_comps=51
    finished (0:00:07)

PCA analysis of first 20 PCs

In [32]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Set raw attribute for each adata
for adata in adata_list:
    sc.tl.pca(adata, svd_solver='arpack', n_comps=20)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:01)
computing PCA
    on highly variable genes
    with n_comps=20
    finished (0:00:02)

Scatter plot generation of some popular markers genes in the PCA coordinates

In [33]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Set raw attribute for each adata
for adata in adata_list:
    sc.pl.pca(adata, color= ['CD14', 'CD79A','CD3D', 'FCER1A','NKG7','CST3'])

#scatter plot generation in the PCA coordinates, with 'CD14', 'CD79A','CD3D', 'FCER1A','NKG7' and 'CST3'

print("CD14: CD14+ Monocytes, CD79A: B cell,  CD3D : CD4+ T cell, FCER1A: CD16+ Monocyte, NKG7: NK cell, CST3: Dendritic cells")
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

CD14: CD14+ Monocytes, CD79A: B cell,  CD3D : CD4+ T cell, FCER1A: CD16+ Monocyte, NKG7: NK cell, CST3: Dendritic cells

Saving data ¶

In [34]:
import os
from scipy import io

save_files = [
    '/home/jana/pbmcsarc1.h5ad',
    '/home/jana/pbmcsarc2.h5ad',
    '/home/jana/pbmcsarc3.h5ad',
    '/home/jana/pbmchealth1.h5ad',
    '/home/jana/pbmchealth2.h5ad',
    '/home/jana/pbmchealth3.h5ad',
    '/home/jana/pbmchealth4.h5ad'
]

adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Save each adata to the corresponding file
for adata, save_file in zip(adata_list, save_files):
    adata.write_h5ad(save_file)

Deleting individual datasets to save space

In [35]:
# Deleting individual datasets to save space

del(pbmcsarc1, pbmcsarc2,pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4)

Reading dataset ¶

In [36]:
#Reading last saved annoatated data object written in h5ad data format. 
#We used similar adata variable to make similar previous data analysis 

# List of file paths
file_paths = [
    '/home/jana/pbmcsarc1.h5ad',
    '/home/jana/pbmcsarc2.h5ad',
    '/home/jana/pbmcsarc3.h5ad',
    '/home/jana/pbmchealth1.h5ad',
    '/home/jana/pbmchealth2.h5ad',
    '/home/jana/pbmchealth3.h5ad',
    '/home/jana/pbmchealth4.h5ad'
]

# List to store loaded data objects
data_objects = []

# Loop to read h5ad files and store data objects
for file_path in file_paths:
    data_objects.append(sc.read_h5ad(file_path))

# Unpack data objects to individual variables
pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4 = data_objects

Displaying all samples in this workspace

In [37]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]
for adata in adata_list:
    print (adata)
AnnData object with n_obs × n_vars = 6962 × 19671
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 9779 × 20394
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 8324 × 18909
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 5921 × 24730
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 4881 × 25757
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 3733 × 22187
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
AnnData object with n_obs × n_vars = 11808 × 27363
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing the neighborhood graph and clustering ¶

In [38]:
datasets = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Compute UMAP coordinates for each dataset
for dataset in datasets:
    sc.pp.neighbors(dataset, n_neighbors=10, n_pcs=20)
    sc.tl.umap(dataset)
i=1
# Perform Leiden clustering for each dataset at different resolutions
for dataset in datasets:
    sc.tl.leiden(dataset)
    sc.tl.leiden(dataset, key_added="leiden_res0_20", resolution=0.20)
    sc.tl.leiden(dataset, key_added="leiden_res0_40", resolution=0.40)
    sc.tl.leiden(dataset, key_added="leiden_res0_60", resolution=0.60)
    sc.tl.leiden(dataset, key_added="leiden_res0_80", resolution=0.80)
    sc.tl.leiden(dataset, key_added="leiden_res1", resolution=1.0)

# Plot UMAP visualization with different cluster labels
    display ("sample no"+str(i))
    sc.pl.umap(dataset, color=["leiden_res0_20", "leiden_res0_40", "leiden_res0_60", "leiden_res0_80", "leiden_res1"], legend_loc="on data")
    i=i+1
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:06)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:24)
computing neighbors
    using 'X_pca' with n_pcs = 20
/home/jana/my-notebook-venv/lib/python3.8/site-packages/numba/core/typed_passes.py:334: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "my-notebook-venv/lib/python3.8/site-packages/umap/rp_tree.py", line 135:
@numba.njit(fastmath=True, nogil=True, parallel=True)
def euclidean_random_projection_split(data, indices, rng_state):
^


/home/jana/my-notebook-venv/lib/python3.8/site-packages/umap/nndescent.py:91: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "my-notebook-venv/lib/python3.8/site-packages/umap/utils.py", line 409:
@numba.njit(parallel=True)
def build_candidates(current_graph, n_vertices, n_neighbors, max_candidates, rng_state):
^


/home/jana/my-notebook-venv/lib/python3.8/site-packages/numba/core/typed_passes.py:334: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "my-notebook-venv/lib/python3.8/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^


    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:12)
computing UMAP
/home/jana/my-notebook-venv/lib/python3.8/site-packages/umap/spectral.py:227: UserWarning:

Embedding a total of 2 separate connected components using meta-embedding (experimental)

    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:32)
computing neighbors
    using 'X_pca' with n_pcs = 20
/home/jana/my-notebook-venv/lib/python3.8/site-packages/numba/core/typed_passes.py:334: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "my-notebook-venv/lib/python3.8/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^


    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:27)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
computing UMAP
/home/jana/my-notebook-venv/lib/python3.8/site-packages/umap/spectral.py:227: UserWarning:

Embedding a total of 2 separate connected components using meta-embedding (experimental)

    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:18)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
computing UMAP
/home/jana/my-notebook-venv/lib/python3.8/site-packages/umap/spectral.py:227: UserWarning:

Embedding a total of 3 separate connected components using meta-embedding (experimental)

    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:15)
computing neighbors
    using 'X_pca' with n_pcs = 20
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
/home/jana/my-notebook-venv/lib/python3.8/site-packages/umap/spectral.py:227: UserWarning:

Embedding a total of 2 separate connected components using meta-embedding (experimental)

    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:11)
computing neighbors
    using 'X_pca' with n_pcs = 20
/home/jana/my-notebook-venv/lib/python3.8/site-packages/numba/core/typed_passes.py:334: NumbaPerformanceWarning:


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "my-notebook-venv/lib/python3.8/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^


    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:04)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:14)
running Leiden clustering
    finished: found 23 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 15 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 18 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 21 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 23 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:00)
'sample no1'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 22 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 10 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 14 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 16 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 20 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 22 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:01)
'sample no2'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 21 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 12 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 16 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 20 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 21 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:00)
'sample no3'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 22 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 9 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 13 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 18 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 22 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 22 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:00)
'sample no4'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 24 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 14 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 17 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 21 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 24 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:00)
'sample no5'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 23 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 14 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 15 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 18 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 21 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:00)
running Leiden clustering
    finished: found 23 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:00)
'sample no6'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

running Leiden clustering
    finished: found 27 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 13 clusters and added
    'leiden_res0_20', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 17 clusters and added
    'leiden_res0_40', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 20 clusters and added
    'leiden_res0_60', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 24 clusters and added
    'leiden_res0_80', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
    finished: found 27 clusters and added
    'leiden_res1', the cluster labels (adata.obs, categorical) (0:00:01)
'sample no7'
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning:

No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored

Displaying all samples in this workspace

In [39]:
adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]
for adata in adata_list:
    print (adata)
AnnData object with n_obs × n_vars = 6962 × 19671
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 9779 × 20394
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 8324 × 18909
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 5921 × 24730
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 4881 × 25757
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 3733 × 22187
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
AnnData object with n_obs × n_vars = 11808 × 27363
    obs: 'type', 'sample', 'percent_chrY', 'XIST-counts', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'leiden', 'leiden_res0_20', 'leiden_res0_40', 'leiden_res0_60', 'leiden_res0_80', 'leiden_res1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_res0_20_colors', 'leiden_res0_40_colors', 'leiden_res0_60_colors', 'leiden_res0_80_colors', 'leiden_res1_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Visualize highest expression genes

In [42]:
# Visualize highest expression genes for each dataset in separate panels using a for loop
# Explore the loaded data for each dataset using a for loop


adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

def highest_expr_genes(adata_list, n_top=20):
    for i, adata in enumerate(adata_list):
        sc.pl.highest_expr_genes(adata, n_top=n_top, show=False)
        plt.title(f'sample {i+1}')
        plt.show()
        

# Example usage:
highest_expr_genes(adata_list, n_top=20)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:00)
normalizing counts per cell
/home/jana/my-notebook-venv/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning:

Some cells have zero counts

    finished (0:00:01)

Saving data after neighborhood graph and clustering ¶

In [43]:
save_files = [
    '/home/jana/pbmcsarc1.h5ad',
    '/home/jana/pbmcsarc2.h5ad',
    '/home/jana/pbmcsarc3.h5ad',
    '/home/jana/pbmchealth1.h5ad',
    '/home/jana/pbmchealth2.h5ad',
    '/home/jana/pbmchealth3.h5ad',
    '/home/jana/pbmchealth4.h5ad'
]

adata_list = [pbmcsarc1, pbmcsarc2, pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4]

# Save each adata to the corresponding file
for adata, save_file in zip(adata_list, save_files):
    adata.write_h5ad(save_file)

Deleting individual datasets to save space

In [44]:
# Deleting individual datasets to save space

del(pbmcsarc1, pbmcsarc2,pbmcsarc3, pbmchealthy1, pbmchealthy2, pbmchealthy3, pbmchealthy4)
In [ ]: